In this document only the differences to the original workflow are described. A more detailed description of the workflow is found in the original tutorial description. The code can be found here

Preperation

The workflow of the Seurat tutorial is adapted to the new dataset:

1k Brain Cells from an E18 Mouse

The filtered data set is unpacked into the data folder of the project.

Loading libraries

library(dplyr)
library(Seurat)
library(patchwork)

Setup the seurat Object

The path has to be changed to the folder with the new data.

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/neurons_900_filtered_gene_bc_matrices/filtered_gene_bc_matrices/mm10/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "neurons_900", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat 
## 14144 features across 917 samples within 1 assay 
## Active assay: RNA (14144 features, 0 variable features)

Preprocessing and Quality control

In the new dataset the mitochondrial genes start with “mt-”. The lower case is not recognized by the original pattern. The code was adapted to find the desired genes in the new dataset.

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-|^mt-") # changing pattern

Visualization of QC Metrices

# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Visualization of feature-feature relationships by FeatureScatter

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

Subsets with specific filter criteria can also be defined

The new dataset has a higher number of features. Therefore the filter was redefined according to the plots above.

pbmc <- subset(pbmc, subset = nFeature_RNA > 1000 & nFeature_RNA < 7000 & percent.mt < 10) # adapted values according to the plots

Normalization of the data

The normalization is performed using the standard parameters.

pbmc <- NormalizeData(pbmc)

Feature selection for the identification of highly variable features

The number of selected features remains unchanged for the new data

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

Scaling of the data

We use the standard options here as we just need the selected features afterwards. There is no need to scale the not-selected features.

pbmc <- ScaleData(pbmc)

Linear dimension reduction

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Visualizing cells and features that define PCA

# Examine and visualize PCA results in a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  Dbi, Phgdh, Hmgb2, Ddah1, Fabp7 
## Negative:  Tubb3, Tmsb10, Stmn3, Neurod2, Sox11 
## PC_ 2 
## Positive:  Pantr1, Ran, Hmgb1, Meis2, Hmgn2 
## Negative:  Igfbp7, Col4a2, Gng11, Bgn, Emcn 
## PC_ 3 
## Positive:  Aldoc, Ndrg2, Mt3, Mt1, Mt2 
## Negative:  Top2a, Spc25, Birc5, Cdca2, Kif22 
## PC_ 4 
## Positive:  Rpl18a, Eomes, Mfap4, Mfng, Sstr2 
## Negative:  Ly6h, Tubb2a, Islr2, Stmn2, Gap43 
## PC_ 5 
## Positive:  Ptprz1, Pantr1, Dab1, Meis2, Clmp 
## Negative:  Reln, Trp73, Nhlh2, Cacna2d2, Lhx5
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot(pbmc, reduction = "pca")

We observe that PC1 is evenly distributed while PC2 is influenced by a low number of values. ## Visualization with DimHeatmap() As in the original Tutorial we use the heat maps to select the PCs containing the most relevant information for our dataset. We observe that the separation is not optimal and that PC1 has the biggest impact.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

## Determine the dimensionality of a dataset We use the JackStraw plot as well as the EllbowPlot the determine the number of PCs used for further analysis. ### Jack Straw Plot

# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

Comparison of the distribution of p-values for each PC with a unit distribution. Significant PCs show a strong enrichment of features with low p-values (solid curve above the dashed line).

JackStrawPlot(pbmc, dims = 1:20)

### Elbow plot

ElbowPlot(pbmc, ndims = 30)

### Select PCs The Jack Straw Plot is not conclusive. The ellbow plot suggest that 15 PCs could be a good choice. We will later try different amount of PCs to check if the results are robust to these changes.

used_dims = 15

Cluster the cells

We use the standard graph based clustering approach as we do not know how many cluster we can expect.

pbmc <- FindNeighbors(pbmc, dims = 1:used_dims)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 898
## Number of edges: 26689
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8048
## Number of communities: 7
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACCTGGTCTCGTTC-1 AAACGGGAGCCACGTC-1 AAACGGGAGCGAGAAA-1 AAACGGGCACACCGAC-1 
##                  1                  0                  0                  4 
## AAACGGGTCGCCAGCA-1 
##                  1 
## Levels: 0 1 2 3 4 5 6

Non-linear dimensional reduction(UMAP)

We use UMAP to visualize the clusters.

pbmc <- RunUMAP(pbmc, dims = 1:used_dims)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")

The name of the saved Seurat object is changed to seperate the different datasets.

### output ordner angelegt
saveRDS(pbmc, file = "../output/pbmc_neurons_900.rds")

Search for differentially expressed traits (cluster biomarkers)

We search for the 2 genes with the highest fold change for each cluster. Later we use them to mark the clusters.

# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
biomarkers <- pbmc.markers %>%
    group_by(cluster) %>%
    slice_max(n = 2, order_by = avg_log2FC)
print(biomarkers)
## # A tibble: 14 × 7
## # Groups:   cluster [7]
##       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 1.40e-34       1.57 0.692 0.294  1.99e-30 0       Pcp4  
##  2 2.05e-29       1.43 0.562 0.206  2.90e-25 0       Eomes 
##  3 4.50e-77       2.17 0.85  0.242  6.36e-73 1       Opcml 
##  4 3.32e-59       1.75 0.662 0.145  4.70e-55 1       Snca  
##  5 1.98e-50       1.21 0.974 0.741  2.80e-46 2       Gpm6a 
##  6 9.41e-46       1.21 0.907 0.521  1.33e-41 2       Myt1l 
##  7 7.28e-54       4.31 0.934 0.647  1.03e-49 3       Fabp7 
##  8 4.08e-80       3.52 0.934 0.299  5.77e-76 3       Mt3   
##  9 3.23e-45       2.69 0.84  0.179  4.57e-41 4       Ube2c 
## 10 6.31e-40       2.60 1     0.554  8.92e-36 4       Hmgb2 
## 11 1.49e-31       2.79 0.34  0.021  2.11e-27 5       Reln  
## 12 3.68e-25       2.28 0.321 0.026  5.20e-21 5       Snhg11
## 13 2.44e-71       2.80 0.92  0.044  3.46e-67 6       Nrxn3 
## 14 5.54e-14       2.54 0.84  0.321  7.83e-10 6       Mef2c

Visualization of the marker expression

We use the VinPlot to show the distribution of the genes with the highest positive and negative loadings of PC1. We see the different expressions in the different clusters and can, for example, see that Phgdh is almost exlusivly expressted in the clusters 3 and 4.

VlnPlot(pbmc, features = c("Dbi", "Phgdh"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("Tubb3", "Tmsb10"), slot = "counts", log = TRUE)

Nest we check our marker genes for the different clusters. We slice the biomarkers data frame to just view the genes with the highest fold change.

FeaturePlot(pbmc, features = biomarkers$gene[c(TRUE, FALSE)])

DoHeatmap can be used to generate an expression heatmap for cells and features. The top 20 markers for each cluster are plotted here.

pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Assigning marker genes to the clusters

In this step we do not have enough background information to link our marker genes with cell types. Therefore we mark the clusters with the best marker gene.

new.cluster.ids <- biomarkers$gene[c(TRUE, FALSE)]
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

saveRDS(pbmc, file = "../output/neurons_900_final.rds")